Gpu improvements #66

jedwards4b · 2022-11-03T14:44:26Z

Remove special definitions for gpu enabled compilers. WIP

sjsprecious · 2022-11-03T15:33:02Z

machines/cmake_macros/nvhpc_casper.cmake

+message("GPU_TYPE is ${GPU_TYPE} GPU_OFFLOAD is ${GPU_OFFLOAD}")
+if (GPU_TYPE STREQUAL v100 AND GPU_OFFLOAD STREQUAL OpenACC)
+   string(APPEND GPUFLAGS  " -acc -gpu=cc70,lineinfo,nofma -Minfo=accel ")
+endif()


can we issue an error message and exit the compilation for a configuration other than V100 GPU and OpenACC? In this way we won't accidentally compile the code with undesired flags.

Also, we could add the A100 GPU and OpenACC configuration here with the flag -acc -gpu=cc80,lineinfo,nofma -Minfo=accel.

I disagree with adding an error message and exiting - we want users to be able to expand the list of supported compilers, machines and offload options. I can look into adding a warning.

Yes we can add the a100 options.

So my concern is that saying we specify V100 GPU and OpenMP offload here, the codes will still compile but no GPU flags are actually applied, which is confusing to me. By adding an error message, we can force a user to add the correct options before the code can be compiled. What do you think?

I think that I can enforce this by making the MAX_GPUS_PER_NODE variable in config_machines.xml compiler dependent. How does that sound?

Oh, I did not realize that you had provided a full list of combinations for OpenACC and OpenMP offload. In this case, my concern is probably unnecessary anymore.

sjsprecious · 2022-11-03T15:35:50Z

machines/config_batch.xml

-    </directives>
-
-    <directives queue="casper" compiler="nvhpc-gpu">
+    <directives queue="casper" compiler="nvhpc">
      <!-- Turn on MPS server manually -->
      <!-- This is a temporary solution and should be removed once MPS is integrated into PBS on Casper -->
      <directive default="/bin/bash" > -S /glade/u/apps/dav/opt/nvidia-mps/mps_bash </directive>


CISL has integrated the MPS into PBS scheduler. So we could turn on MPS by adding mps=1 to the PBS resource -l select=...., which is the next line. This temporary solution is no longer needed.

can the mps=1 be added to the nvhpc select regardless of whether gpus are actually used or should it only be there for gpu cases?

Great question and I do not know. For safety, can we add the ngpus and mps options only when GPU_OFFLOAD method is not none?

And we have the ngpus_per_node option for create_newcase script. That could also be used to determine if a job should be submitted to a GPU queue or not, and what PBS resources should be selected later.

sjsprecious · 2022-11-03T15:39:10Z

machines/config_machines.xml

@@ -478,15 +479,17 @@ This allows using a different mpirun command to launch unit tests
        <command name="load">openmpi/4.1.0</command>
        <command name="load">netcdf-mpi/4.7.4</command>
        <command name="load">pnetcdf/1.12.2</command>
+      </modules>
+      <modules mpilib="openmpi" compiler="nvhpc" gpu_offload="OpenACC">
        <command name="load">cuda/11.0.3</command>


I later realized that the nvhpc module already comes with CUDA. So we probably did not need to load CUDA manually unless the newer CUDA version had a problem. May be worth testing it.

already tested - on casper the cuda module seems to be required even though it shouldn't be

Aha, I see. In this case, can we upgrade the CUDA version to 11.6 since that is the version coming with nvhpc/22.2?

sjsprecious · 2022-11-03T15:40:02Z

machines/config_machines.xml

        <command name="load">netcdf-mpi/4.8.1</command>
-        <command name="load">pnetcdf/1.12.2</command>
+        <command name="load">pnetcdf/1.12.3</command>
      </modules>
      <modules mpilib="mpi-serial" compiler="nvhpc">
        <command name="load">netcdf/4.8.1</command>


Shall we delete the lines 497 to 505?

sjsprecious · 2022-11-03T15:40:38Z

machines/config_machines.xml

+      <modules compiler="nvhpc" mpilib="openmpi" DEBUG="FALSE">
+        <command name="use">/glade/p/cesmdata/cseg/PROGS/modulefiles/esmfpkgs/nvhpc/22.2/</command>
+        <command name="load">esmf-8.4.0b08_casper-ncdfio-openmpi-O</command>
+      </modules>
      <modules compiler="pgi" mpilib="openmpi" DEBUG="TRUE">


Shall we delete the PGI related settings for Casper now?

sjsprecious · 2022-11-03T15:41:37Z

machines/config_machines.xml

@@ -417,6 +417,7 @@ This allows using a different mpirun command to launch unit tests
    <DIN_LOC_ROOT_CLMFORC>/glade/p/cgd/tss/CTSM_datm_forcing_data</DIN_LOC_ROOT_CLMFORC>
    <DOUT_S_ROOT>$CIME_OUTPUT_ROOT/archive/$CASE</DOUT_S_ROOT>
    <BASELINE_ROOT>$ENV{CESMDATAROOT}/cesm_baselines</BASELINE_ROOT>
+    <CCSM_CPRNC>$ENV{CESMDATAROOT}/tools/cime/tools/cprnc/cprnc</CCSM_CPRNC>


remove pgi and *-gpu at line 413.

I know I haven't completed the cleanup here yet.

…p and combined options

sjsprecious · 2022-11-03T16:08:39Z

machines/cmake_macros/nvhpc.cmake

+if (GPU_TYPE STREQUAL a100 AND GPU_OFFLOAD STREQUAL OpenACC)
+   string(APPEND GPUFLAGS  " -acc -gpu=cc80,lineinfo,nofma -Minfo=accel ")
+endif()
+if (GPU_TYPE STREQUAL a100 AND GPU_OFFLOAD STREQUAL OpenACC)


Replace OpenACC with OpenMP.

…ovements

Use nvhpc/22.11, cray-mpich/8.1.21, and esmf/8.4.1.b02 modified: machines/cmake_macros/gust.cmake modified: machines/config_machines.xml

…_config_cesm/compare/52c06b3..fbc05d6 Update settings on Gust and Casper modified: Depends.nvhpc deleted: Depends.nvhpc-gpu deleted: cmake_macros/nvhpc-gpu.cmake deleted: cmake_macros/nvhpc-gpu_casper.cmake modified: cmake_macros/nvhpc.cmake modified: cmake_macros/nvhpc_casper.cmake deleted: cmake_macros/pgi-gpu.cmake deleted: cmake_macros/pgi-gpu_casper.cmake modified: config_batch.xml modified: config_machines.xml deleted: mpi_run_gpu.casper

…asper and Gust modified: machines/config_machines.xml

modified: config_batch.xml

modified: machines/config_machines.xml

modified: machines/config_batch.xml

load cuda module on Gust for a GPU run deleted: cmake_macros/nvhpc_gust.cmake modified: config_machines.xml

modified: machines/config_machines.xml

modified: config_batch.xml

modified: machines/config_batch.xml modified: machines/config_machines.xml

modified: config_machines.xml

modified: machines/config_machines.xml

sjsprecious

Thanks @jedwards4b for issuing this PR. It generally looks good to me but I have some clarifications / change requests before approving it.

sjsprecious · 2023-05-26T20:46:46Z

machines/config_machines.xml

      <modules mpilib="openmpi" compiler="nvhpc">
+        <command name="load">cuda/11.6</command>


shall we only load the CUDA module when gpu_offload is not None? Or gpu_enabled is False?

sjsprecious · 2023-05-26T20:51:18Z

machines/config_machines.xml

    <MAX_MPITASKS_PER_NODE>36</MAX_MPITASKS_PER_NODE>
+    <MAX_CPUTASKS_PER_GPU_NODE>36</MAX_CPUTASKS_PER_GPU_NODE>
+    <GPU_TYPES>v100,a100</GPU_TYPES>


Shall we specify gpu_offload variable as well? For machine with AMD GPUs, openacc and combined may not be a valid input.

How do we handle fortran do concurrent here?

This is a great question! In this case, combined may be a confusing word, too. Is there any plan to introduce do concurrent to CAM in the future?

sjsprecious · 2023-05-26T20:52:30Z

machines/config_machines.xml

    <MAX_MPITASKS_PER_NODE>128</MAX_MPITASKS_PER_NODE>
+    <MAX_CPUTASKS_PER_GPU_NODE>64</MAX_CPUTASKS_PER_GPU_NODE>
+    <GPU_TYPES>a100</GPU_TYPES>


The same question about gpu_offload variable.

sjsprecious · 2023-05-26T20:54:28Z

machines/config_machines.xml

    <MAX_MPITASKS_PER_NODE>128</MAX_MPITASKS_PER_NODE>
+    <MAX_CPUTASKS_PER_GPU_NODE>64</MAX_CPUTASKS_PER_GPU_NODE>
+    <GPU_TYPES>a100</GPU_TYPES>
    <PROJECT_REQUIRED>TRUE</PROJECT_REQUIRED>
    <mpirun mpilib="default">
      <executable>mpiexec</executable>


At line 1838, shall we do purge first before loading any module?

This is a feature of modules on gust (and soon derecho) - the two modules loaded above the purge are sticky and not affected by the purge command.

Thanks for the clarification. That is clear to me now.

sjsprecious · 2023-05-26T20:58:54Z

machines/config_batch.xml

+    <submit_args>
+      <argument> -l gpu_type=$GPU_TYPE </argument>
+    </submit_args>
+    <directives queue="casper" compiler="nvhpc" gpu_enabled="true">


I know this works but I am not sure how gpu_enabled actually works. Is it an XML variable defined somewhere? And how is it set to True or False during the build. A brief explanation will be very helpful.

This is done here: https://github.com/jedwards4b/cime/blob/add_gpu_gust/CIME/case/case.py#L457

gpu_enabled is an attribute of the case object and is set to true if GPU_TYPE is set to a valid value.

Thanks @jedwards4b for the details. That is very helpful!

jedwards4b added 2 commits November 1, 2022 17:17

improve handling of gpu options

0df3f2a

only provide gpu flags to specific files

a13d803

jedwards4b self-assigned this Nov 3, 2022

jedwards4b marked this pull request as draft November 3, 2022 14:44

fischer-ncar approved these changes Nov 3, 2022

View reviewed changes

fischer-ncar self-requested a review November 3, 2022 15:41

sjsprecious suggested changes Nov 3, 2022

View reviewed changes

generalize these flags, they should not be casper specific, add openm…

09a6835

…p and combined options

sjsprecious suggested changes Nov 3, 2022

View reviewed changes

jedwards4b and others added 21 commits November 3, 2022 10:11

correct typo, update cuda module

ff2b2fe

more cleanup of gpu functionality

1d7974a

remove -gpu stuff

fbc05d6

Merge remote-tracking branch 'origin/gust_update022423' into gpu_impr…

0a4702a

…ovements

Add Jim's changes for Gust's modules from 87b9539

9beb292

Use nvhpc/22.11, cray-mpich/8.1.21, and esmf/8.4.1.b02 modified: machines/cmake_macros/gust.cmake modified: machines/config_machines.xml

Add max_cputasks_per_gpu_node, gpu_type and gpu_offload options for C…

3b7a182

…asper and Gust modified: machines/config_machines.xml

update batch script template for Gust GPU node

ee726b5

modified: config_batch.xml

typo fix for mismatched tags for GPU options

1f5eb30

modified: machines/config_machines.xml

Add missing "MAX_GPUS_PER_NODE"

8649a95

modified: machines/config_machines.xml

downgrade netcdf-mpi fpr nvhpc/22.11

e7bfbe1

modified: machines/config_machines.xml

bug fix of batch script template with gpu_enabled option

bf14967

modified: machines/config_batch.xml

remove -target=zen3 flag on gust

14f605e

load cuda module on Gust for a GPU run deleted: cmake_macros/nvhpc_gust.cmake modified: config_machines.xml

update to nvhpc/23.1 and related modules

59086af

modified: machines/config_machines.xml

turn off MPS on Gust

918e3a8

modified: config_batch.xml

updates from jian

85045b9

update to ncarenv/23.03

f2b2fda

Update module versions for intel compiler

d0093a5

modified: machines/config_batch.xml modified: machines/config_machines.xml

Merge branch 'main' into add_gpu_gust

4023d67

update esmf version on gust

4e375b2

load ncarenv and ncarcompilers manually

7224a37

modified: config_machines.xml

sjsprecious and others added 5 commits April 25, 2023 15:18

update the ncarenv and cesmdev module on Gust

822f7b7

modified: machines/config_machines.xml

merge to ccs_config_cesm0.0.71

21fc13a

fix unresolved merge issue

aaec12b

update default wallclock time

df29563

merge jian/add_gpu_gust

823653d

jedwards4b marked this pull request as ready for review May 26, 2023 20:04

jedwards4b requested a review from sjsprecious May 26, 2023 20:04

sjsprecious suggested changes May 26, 2023

View reviewed changes

add GPU_OFFLOAD

a002bdc

jedwards4b closed this Oct 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gpu improvements #66

Gpu improvements #66

jedwards4b commented Nov 3, 2022

sjsprecious Nov 3, 2022

sjsprecious Nov 3, 2022

jedwards4b Nov 3, 2022

sjsprecious Nov 3, 2022

jedwards4b Nov 3, 2022

sjsprecious Nov 3, 2022

sjsprecious Nov 3, 2022

jedwards4b Nov 3, 2022

sjsprecious Nov 3, 2022

sjsprecious Nov 3, 2022

sjsprecious Nov 3, 2022

jedwards4b Nov 3, 2022

sjsprecious Nov 3, 2022

sjsprecious Nov 3, 2022

sjsprecious Nov 3, 2022

sjsprecious Nov 3, 2022

jedwards4b Nov 3, 2022

sjsprecious Nov 3, 2022

sjsprecious left a comment

sjsprecious May 26, 2023

sjsprecious May 26, 2023

jedwards4b May 26, 2023

sjsprecious May 30, 2023

sjsprecious May 26, 2023

sjsprecious May 26, 2023

jedwards4b May 26, 2023

sjsprecious May 30, 2023

sjsprecious May 26, 2023

jedwards4b May 26, 2023

sjsprecious May 30, 2023

		<modules mpilib="openmpi" compiler="nvhpc">
		<command name="load">cuda/11.6</command>

Gpu improvements #66

Gpu improvements #66

Conversation

jedwards4b commented Nov 3, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sjsprecious left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment